false
Catalog
Gastroenterology and Artificial Intelligence: 4th ...
Regulations for Platform-based “Suites of AI Techn ...
Regulations for Platform-based “Suites of AI Technology"
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Now, moving on to our next speaker, it's Dr. Nicholas Patrick from the FDA. We had him on the panel earlier. So he'll be talking about regulations for platform-based suites in AI and technology. Thank you. So I'll talk about the regulatory aspects, and I'll give you a little bit of an overview of FDA and what we do, and then I'll talk a little bit about specifically how we've looked at regulating the CADx devices for endoscopy. Just some disclaimers and disclosures. So medical devices fall into different categories. They go from class 1 to class 3. That's how we regulate them. And the regulatory controls increase from class 1 to class 3. So class 1 may be things like gloves or manual stethoscopes, things along those lines. A lot of devices fall into what we call class 2, and I'll give you a little more detail, but endoscopes tend to be in class 2. And then more high-risk devices end up being in class 3, so a little bit higher risk. And there are different data requirements associated with those various classes. So I'll just give you sort of a quick overview. Class 1 devices use something what we call general controls. These are just general standards or information that's necessary for any device that goes onto the market. Most of those devices are considered low-risk, and they're exempt from any data requirements. So we typically don't see a full submission coming in. They just need to meet a certain set of standards to move forward. The class 2 devices, which, again, is the vast majority of devices that go onto the market, they can come onto the market in two different ways. One is called a 510K or pre-market notification. And in this case, the goal of the device that's coming onto the market is to show it's substantially equivalent to a predicate device. So a predicate device is a device that's actually already on the market and available to use. Substantial equivalence is sort of a nebulous term. It was made that way because it wasn't defined when this law was made in order to allow for doing various types of comparisons of devices and to allow that to be flexible across a large range of potential devices. So when there's something on the market, you're doing something that's basically equivalent to that. You would show substantial equivalence. There's also another way to get on the market. This is what happened with the GI Genius device. It went onto the market as a de novo, which means there was no device in that category already, but we wanted to put it into class two because we felt we had an approach for doing the regulatory process for that. So it's a means for a new device without a valid predicate to be classified into class two directly, class one or class two directly. And then finally, we have class three devices. These are the highest risk devices. These devices need to demonstrate reasonable assurance of safety and effectiveness. So these devices really have to stand on their own. If there's a device that's a PMA that has the same functionality as a new PMA, that new PMA needs to establish on its own that it's safe and effective. So it's a different higher risk and different types of studies sometimes that are done. Not always different types of studies, but sometimes. So we already have one device in endoscopy that's an AI. This is a gastrointestinal lesion software detection system. It has an abbreviation of QNP if you look for it, although once it fall into this category QNP, it was established through the de novo process. And you can read the definition, but it's a computer assist detection device used in conjunction with endoscopy for the detection of abnormal lesions in the gastrointestinal tract. And again, it's an aided device. It's aiding the clinician in detecting lesions. And so this is a category. And we have three unique devices cleared under QNP. We have the original de novo, which was a GI genius. You can see there's a endoscreener that was approved in, I believe, 2021, and the scout was actually just cleared. They weren't approved. They were cleared just, I think, last week. So there are three devices in there on the market now. So when we develop, when we do something and we put it as a de novo into special controls, we are, when we put a de novo into class two, we have something called special control. So you saw there are general controls, which are sort of the general requirements, labeling and EMC and things like that that you need to need. There are also special controls. And in this case, they're very specific. So there's a need to perform a clinical testing, a clinical study when you put these on the market. There are requirements around non-clinical testing, and in particular, which I'll talk about a little bit, is the standalone algorithm performance testing. It's a really important component, and I want to make sure people don't underestimate the value of it. The clinical study is important, but also these non-clinical testings are important. There's also really important to understand, there can be a degradation in image quality when you try to add marks on top of the endoscopic examination. So image quality is another factor that's evaluated in these studies. The video delay, again, you can have delays associated with these that may actually impact performance. So again, another important factor to think about. I'm not going to go into a lot of detail on those, but I'm going to give you a little bit. Usability assessment is part of it. EAM, electrical safety testing, the software verification validation, or verification validation and hazard analysis. It's true for all softwares, but again, it's an important component. And then labeling, and there's multiple parts of labeling. I just wanted to highlight one aspect, which is compatibility. Because you get an AI does not necessarily mean you get it labeled and it's available for every possible combination of processor scope and everything else that could go into the market. They can have a big actual impact that variations in that acquisition actually has a strong impact on the AI. And so it's going to be the goal of the manufacturer to develop either a whitelist of what devices is it compatible with, or have special requirements around what the acquisition needs to be in order to be compatible with that AI. So I just want to make sure people are aware that it's an important component, and a lot of times overlooked until the very end, and companies say, well, what's compatibility? And then we have to try to figure out how to do that. Just an example, hopefully this will run, maybe it won't run, I'm not sure. Anyway, this is a GI Genius device. Just show you what an AI is. In this case, it's identifying a polyp here, and probably most people have seen at least examples of this before. I'm going to show you this again at the end, where I'm going to try to show you the complications of trying to benchmark performance around an AI. So I also wanted to highlight a few categories of AI that I felt were probably most relevant to endoscopy types of devices. So we had the computer-aided detection, which people know about, there have already been talked about computer-aided diagnosis, where you're trying to differentiate, say, an adenoma from a hyperplastic, or many multiple categories. You could obviously have a combination of those. You could do the detection and the diagnostic together. Another category that I think is important to realize as a possibility is there's a device that cleared in this, an ultrasound, where the goal is not to improve the diagnostic task, but it's actually to improve the acquisition of the actual data. So an ultrasound, it tells the technologist how to adjust the transducer in order to get the best quality video, the best quality snapshot of that image. And of course, that could be something that could be very valuable in endoscopy as well. Could you have an AI that helps the clinician to do a better job at collecting the data they need to make a proper diagnosis or to do a proper intervention? So again, that's a category. There are other areas as well in radiology that I'm not going to talk about here because they're probably not too specific here. But there can be other types of applications as well. So now I'm going to just talk about two components of doing these assessment studies and just give you an idea of what was done or what is being done with these studies. And first, I'm going to talk about clinical testing. And the type of testing we do is what we call a multi-reader, multi-case study, or an MRMC study. And I want to just emphasize, one of the big factors in doing anything where there's a clinician involved in doing a determination, whether that's radiology or endoscopy or ophthalmology, is clinician variability. It's an important factor. Every clinician doesn't perform the same way. And even if two clinicians have the same skill level, they may have a different operating point where they differentiate between which patients I want to call back and which patients I don't. So even for the same skill level, you can have different operating points. So it's really an important factor in these studies. So why we like these types of study designs is they generalize to both clinicians and to patients. They also have greater statistical power for a given number of cases. And what I'll show you is they can accommodate a wide range of paradigms. And I'm not going to go through all the paradigms you can do with these. In radiology, a lot of times we do what's called a fully cross-design, where you have a set of patients, a set of readers read those patients. It's easy to do that. We do that retrospectively. In these cases, I'll show you a different design. And then it also has a large number of well-characterized statistical tools. So we have a good way of analyzing this data and for people to utilize those tools when they do a submission, when they come in. So here's what we do in at least these CAD-E devices. These are prospective two-arm studies typically. So each radiologist reads a similar number of patients in the without and the with CAD arms. So you have two different arms. And then different patients for each endoscopist in each arm. So what we want to make sure is there's a consistency in the number of the patient population across the arms in this trial. So now we have two different populations in two different arms. In radiology, again, we typically have the patient as a self-control. That's not the case here. So how do we do this? Well, we just have this example where we have a set of cases that reader one reads, and then a different set of cases, patients that reader two reads, in this case without the aid and continue to do that across whatever number of readers in the population. And again, it's important here, we want to have the clinicians read roughly about the same number of cases in this arm. If we don't do that, then you could have one or two clinicians that pretty much dominate the whole study, and we're not getting a good variability associated with the readings. And then what we do is we do another set of patients in this case in the same readers in the second arm, and again, trying to match the number of reads in each of these arms. So this is a two-arm trial. It's possible to do something like a tandem study design as well, but that hasn't been used yet so far. So that's a basic study design. The reference standard here is extracted and a histopath—so all the polyps are extracted and there's histopathology to confirm adenoma or polyp or whatever the histopathology is. So if there's a polyp that everyone in the world thinks was there but no one extracted, it's not considered an adenoma for sure. So that's the standard we use. And again, it's not a perfect standard, but it is the standard we'd be utilizing. There are two co-primary endpoints. There's typically something like adenomas per colonoscopy or some of the studies have used ADR directly. Typically the APC endpoint's a little easier to do with less cases. And then there's also a control, what we call positive percent agreement. And this is a measure of what is the relevance of the extractions we have? How relevant are they? Are they adenomas or something that we should be extracting versus other extractions? So doing hyperplastic extractions, which is clinical practice, there's not necessarily a value added there. So what we want to try to do is understand how many of the adenomas or important pathologies are removing roughly with how much are we paying and how many additional extractions are we doing. There could be flexibility of how much and what the risk here is associated with it, but we don't want to just pull out all kinds of locations just to get a few more additional adenomas. Secondary endpoints are things like if we're doing a tandem study, AMR may be an endpoint. ADR, obviously we'd want to see as a secondary endpoint. And then we'll be looking at a lot of subgroupings of either pathology, sizing of polyps, patient demographics and other things in those analysis. The statistical analysis plan, again, is we would look for things like superiority in APC and non-inferiority in PPA. Those are typically the margins we would look for, and those margins would be predefined. And again, the analysis should account for both reader and case variability. So again, there's different statistical approaches for how you evaluate these types of studies and you need to be able to use those. And I just want to emphasize again, having a consistent number or relatively close number of reads across the endoscopist participating is really important, otherwise it gets dominated by one or two. We have some studies where they may have an endoscopist, they may have 20 endoscopists, but half the readers only read like three or four cases and the other half read 150. And so those three or four extra ones aren't really helping very much. So again, trying to equalize that is important. And again, in these two-arm studies, it's really important that the patient populations match, because of course if they don't match very well, it may not be the AI at all that's making the difference. It could be just the difference in the populations. These have more adenomas than that population does. And that could either be a benefit or a hurt to the AI, depending on which arm that ends up being. So matching the populations when you have these two different population studies is really important. The other aspect I'd just touch on, again, not the only part of non-clinical testing that's important, but one of the really important ones, and I want to emphasize it, is the performance of the algorithm around what we call standalone performance testing. And the reason we want this is it benchmarks the algorithm's performance and facilitates generalizability analysis. So clinical studies tend to ask a specific question and answer that question, and maybe we get some additional information by subgrouping or other pieces that we can pull out of that. When we look at the algorithm itself, if we test this against much larger datasets, we can start to understand what parts of the population does it work better on, what parts maybe it doesn't work quite so well on. And it's not that necessarily the device overall doesn't work, but we'd like you clinicians to be able to understand, this is an area where I should be paying more attention. Here maybe it performs a little bit better and I should be a little more confident in the AI. And so you should always pay attention throughout the whole exam, but the way you use this device may vary based on what you know about how it works. So again, here for the standalone testing, the reference standard is, again, polyps. What we do is have clinicians identify polyp bounding boxes that divine or confirmed by the endoscopist. So somebody might, a technologist may put them on there, but the endoscopist will approve them. And then again, only histologically confirmed adenomas or polyps are marked in truth. In this case, again, a lot of times companies or people will come in and say, oh, I have these snapshots or these segments, I'm going to show you the performance in those segments. What we really want is a full exam, especially for caddies, because you're marking locations, we want to know how many actual locations, how many false marks are you going to be putting in that patient? Because that's what's going to matter as you do the exam. If you have thousands of locations you need to look at, you're going to tend to ignore some of those locations and not pay attention. You know, if you have three or four of them, as well as the adenomas or the polyps, you know, that system is probably going to work a lot better. And so we need a way of doing that, and we need to look at it across the whole exam. Another important case is the acquisition matters here. So having a range of scopes and processors that you're going to utilize this device for is really important to have in that data set. And again, we're also interested in the patient relevant demographics and disease populations. So what is the type of disease that you're looking at, what are the sizes of them, what's the ratio of adenomas, how many carcinomas may be within that population, what are the demographics of the patients? And at least we want to be able to report that and look at subgroup performance within that. And again, the subgroups, we're not always looking for statistically significant differences, but at least able to get some reasonable size confidence intervals within those. So the two areas that we'll look at are object-based true positives and false positive rate and frame-based true positive and false positive rate. So frame-based is relatively straightforward. Here's just an example. You're going to estimate the true positives here. The green is a true polyp in this case. This is just a toy example. And then these are just false marks that could be on here. So you can see on the left-hand side, you might have one false positive, one true positive, zero true positives, and one false positives, and so on, as you see down there. So this is relatively straightforward. You'll do this per frame. Again, there's correlation because you're looking at the same polyp through multiple frames and you need to account for that in your statistics. The object-based is a little more complicated for endoscopy. And here I just show examples of, now we're going to look at if this video works, of these videos, and you can see that you're going to have, in this case, you're on the polyp, you're off the polyp. That's sort of one detection, right? You're looking at it over a matter of time and trying to see when you see that polyp. But now there's a whole bunch of other marks that are coming up here, and we need to have a way of counting those marks. And we do this by looking at two different factors. One is, how much does the actual boxes overlap the true lesion? And the second part is, what's the persistence of the marks? How long are the marks on the object? And you could have one frame of a polyp that's up on the screen for three, four, or five seconds. Is that good enough? Is that enough of a detection to draw the clinician's attention to it? And it's not clear that that is. So what we want to do is look at persistence. How long are the markers on from when they start to when it may be relevant for the clinician to identify it? So having it on for 10 seconds may not matter so much because the clinician will identify it, but trying to understand and trying to benchmark the performance of the algorithm based on these factors helps us to be able to compare algorithms or to be able to understand how they're performing. So in this example, again, depending on how you might define false positives, you have somewhere between four and eight false positives in this particular little short clip. What we're looking at here, again, is object-based performance. We'll look at true positive rates, false positive rates, free response ROC curves. I talked about persistence and overlap. And the frame-based will do the same thing, except for we may look at ROC performance. And then we'll look at subgroups. And again, we want to account for the correlations within these images. So I just summarized here just quickly, again, I've talked about class one to three devices. Then I talked about some current endoscopy AI or the current endoscopy AI that's on the market, and then talked a little bit about assessment. So thank you.
Video Summary
Dr. Nicholas Patrick from the FDA discusses the regulations for platform-based suites in AI and technology in this video summary. He explains that medical devices are categorized from class 1 to class 3, with regulatory controls increasing from class 1 to class 3. Class 1 devices are low-risk and exempt from data requirements, while class 2 devices, such as endoscopes, require a pre-market notification or be substantially equivalent to a predicate device. Class 3 devices need to demonstrate reasonable assurance of safety and effectiveness. Dr. Patrick mentions the Gastrointestinal Lesion Software Detection System as an AI device in endoscopy that has been cleared under the de novo process. He also discusses the importance of clinical testing and standalone algorithm performance testing in assessing the effectiveness of AI algorithms. Dr. Patrick emphasizes the need for consistent numbers of cases across clinicians and matching patient populations in studies. He highlights the importance of benchmarking algorithm performance, considering variations in image acquisition and understanding the clinical relevance and usability of AI devices for endoscopy. This video was presented by Dr. Nicholas Patrick from the FDA.
Asset Subtitle
Nicholas Petrick, PhD
Keywords
FDA regulations
platform-based suites
AI technology
medical devices
class categorization
×
Please select your language
1
English